In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Read the data, fill in the data file below.
In [2]:
damd = pd.read_csv("", index_col="tweet_id")
damd['hashtags'] = damd['hashtags'].astype(str)
damd.head(3)
Out[2]:
First, build an inverse index of hashtags to tweets, as a Python dict
of the shape.
{'hashtag1': [tweet_1, tweet_2, ...],
'hashtag2': [tweet_2, tweet_3, ...],
⋮
}
For this, we will extract the hashtags
column which is itself a varying length.
In [3]:
hashtags = {}
for (tweet, hashtagsinthistweet) in damd['hashtags'].map(lambda l: l.split(';')).items():
for hashtag in hashtagsinthistweet:
hashtag = hashtag.lower()
if hasthag not in hashtags.keys():
hashtags[hashtag] = [t]
else:
hashtags[hashtag].append(t)
We can remove the damd
hashtag, since in our data every single item has it, and thus it carries no information.
In [4]:
hashtags.pop('damd'); # semicolon at the end of the line suspends output
Next, build a pandas.Series
of the number of tweets each hashtag is used in, from the above reverse index.
In [5]:
counts = pd.Series([len(hashtags[h]) for h in hashtags.keys()], name="count", index=hashtags.keys())
counts.describe()
Out[5]:
Ok that looks like long-tailed distribution. Let's look at a 10 bin histogram.
In [6]:
counts.hist()
Out[6]:
Let's inspect the topmost hashtags, say those which occur more often than 5 times.
In [7]:
ax = counts.loc[counts > 5].sort_values().plot.barh(grid=True, figsize=(5, 15), title="Hashtag occurrences, ignored case (count > 5)")
for (patch, hashtag) in zip(ax.patches, counts.loc[counts > 5].sort_values()):
ax.annotate(hashtag, (patch.get_width() + 5, patch.get_y()))